This notebook is one in a series of many, where we explore how different data analysis strategies affect the outcome of a proteomics experiment based on isobaric labeling and mass spectrometry. Each analysis strategy or ‘workflow’ can be divided up into different components; it is recommend you read more about that in the introduction notebook. In this notebook specifically, we investigate the effect of varying the Summarization component on the outcome of the differential expression results. The three component variants are: median summarization, iPQF, sum summarization.
The R packages and helper scripts necessary to run this notebook are listed below and each code section can be expanded in a similar fashion. You can also download the entire notebook source code.
library(ggplot2)
library(stringi)
library(gridExtra)
library(dendextend)
library(kableExtra)
library(limma)
library(psych)
library(MSnbase) # CAREFUL! load this BEFORE tidyverse, or you will screw up the rename function.
library(tidyverse)
library(CONSTANd) # install from source: https://github.com/PDiracDelta/CONSTANd/
source('other_functions.R')
source('plotting_functions.R')
Let’s load the PSM data set and keep only PSMs with low isolation interference, and without missing values.
data.list <- readRDS('input_data.rds')
dat.l <- data.list$dat.l # data in long format
# keep spectra with (isolation interference <=30 or NA) and no missing quantification channels
dat.l <- dat.l %>% filter(isoInterOk & noNAs)
dat.l
## # A tibble: 464,224 x 19
## Mixture TechRepMixture Condition BioReplicate Run Channel Protein Peptide
## <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct>
## 1 Mixtur… 1 1 Mixture2_1 Mixt… 127C P21291 [K].gF…
## 2 Mixtur… 1 0.5 Mixture2_0.5 Mixt… 127N P21291 [K].gF…
## 3 Mixtur… 1 0.125 Mixture2_0.… Mixt… 128C P21291 [K].gF…
## 4 Mixtur… 1 0.667 Mixture2_0.… Mixt… 128N P21291 [K].gF…
## 5 Mixtur… 1 0.667 Mixture2_0.… Mixt… 129C P21291 [K].gF…
## 6 Mixtur… 1 1 Mixture2_1 Mixt… 129N P21291 [K].gF…
## 7 Mixtur… 1 0.5 Mixture2_0.5 Mixt… 130C P21291 [K].gF…
## 8 Mixtur… 1 0.125 Mixture2_0.… Mixt… 130N P21291 [K].gF…
## 9 Mixtur… 1 1 Mixture2_1 Mixt… 127C P30511 [K].wA…
## 10 Mixtur… 1 0.5 Mixture2_0.5 Mixt… 127N P30511 [K].wA…
## # … with 464,214 more rows, and 11 more variables: RT <dbl>, Charge <fct>,
## # PTM <fct>, Intensity <dbl>, TotalIntensity <dbl>, Ions.Score <int>,
## # DeltaMZ <dbl>, isoInterOk <lgl>, noNAs <lgl>, onehit.protein <lgl>,
## # shared.peptide <lgl>
After this thresholding there are 19 UPS1 proteins remaining, even though 48 were originally spiked in.
# which proteins were spiked in?
spiked.proteins <- dat.l %>% distinct(Protein) %>% filter(stri_detect(Protein, fixed='ups')) %>% pull
remove_factors(spiked.proteins)
## [1] "P02787ups" "P02788ups" "P01133ups" "P69905ups" "P01008ups"
## [6] "P02741ups" "P01375ups" "P10636-8ups" "O00762ups" "P62988ups"
## [11] "P10145ups" "P05413ups" "P00915ups" "P02753ups" "P02144ups"
## [16] "P01344ups" "P01579ups" "P00709ups" "P01127ups"
We pick technical replicates with a dilution of 0.5 as the reference condition of interest. Each condition is represented by two of eight reporter Channels in each Run.
sample.info
## # A tibble: 32 x 7
## Run Run.short Channel Sample Sample.short Condition Colour
## <fct> <fct> <fct> <fct> <fct> <chr> <chr>
## 1 Mixture1_1 Mix1_1 127C Mixture1_1:127C Mix1_1:127C 0.125 black
## 2 Mixture1_1 Mix1_1 127N Mixture1_1:127N Mix1_1:127N 0.667 green
## 3 Mixture1_1 Mix1_1 128C Mixture1_1:128C Mix1_1:128C 1 red
## 4 Mixture1_1 Mix1_1 128N Mixture1_1:128N Mix1_1:128N 0.5 blue
## 5 Mixture1_1 Mix1_1 129C Mixture1_1:129C Mix1_1:129C 0.5 blue
## 6 Mixture1_1 Mix1_1 129N Mixture1_1:129N Mix1_1:129N 0.125 black
## 7 Mixture1_1 Mix1_1 130C Mixture1_1:130C Mix1_1:130C 0.667 green
## 8 Mixture1_1 Mix1_1 130N Mixture1_1:130N Mix1_1:130N 1 red
## 9 Mixture1_2 Mix1_2 127C Mixture1_2:127C Mix1_2:127C 0.125 black
## 10 Mixture1_2 Mix1_2 127N Mixture1_2:127N Mix1_2:127N 0.667 green
## # … with 22 more rows
referenceCondition
## [1] "0.5"
channelNames
## [1] "127C" "127N" "128C" "128N" "129C" "129N" "130C" "130N"
We use the default unit scale: the log2-transformed reportion ion intensities.
dat.unit.l <- dat.l %>% mutate(response=log2(Intensity)) %>% select(-Intensity)
dat.unit.l
## # A tibble: 464,224 x 19
## Mixture TechRepMixture Condition BioReplicate Run Channel Protein Peptide
## <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct>
## 1 Mixtur… 1 1 Mixture2_1 Mixt… 127C P21291 [K].gF…
## 2 Mixtur… 1 0.5 Mixture2_0.5 Mixt… 127N P21291 [K].gF…
## 3 Mixtur… 1 0.125 Mixture2_0.… Mixt… 128C P21291 [K].gF…
## 4 Mixtur… 1 0.667 Mixture2_0.… Mixt… 128N P21291 [K].gF…
## 5 Mixtur… 1 0.667 Mixture2_0.… Mixt… 129C P21291 [K].gF…
## 6 Mixtur… 1 1 Mixture2_1 Mixt… 129N P21291 [K].gF…
## 7 Mixtur… 1 0.5 Mixture2_0.5 Mixt… 130C P21291 [K].gF…
## 8 Mixtur… 1 0.125 Mixture2_0.… Mixt… 130N P21291 [K].gF…
## 9 Mixtur… 1 1 Mixture2_1 Mixt… 127C P30511 [K].wA…
## 10 Mixtur… 1 0.5 Mixture2_0.5 Mixt… 127N P30511 [K].wA…
## # … with 464,214 more rows, and 11 more variables: RT <dbl>, Charge <fct>,
## # PTM <fct>, TotalIntensity <dbl>, Ions.Score <int>, DeltaMZ <dbl>,
## # isoInterOk <lgl>, noNAs <lgl>, onehit.protein <lgl>, shared.peptide <lgl>,
## # response <dbl>
Since CONSTANd needs to be applied on matrix-like data, let’s switch to wide format. (Actually, this is semi-wide, since the Channel columns still have contributions form all Runs, but that’s OK because in the next step we split by Run.)
# switch to wide format
dat.unit.w <- pivot_wider(data = dat.unit.l, id_cols=-one_of(c('Condition', 'BioReplicate')), names_from=Channel, values_from=response)
dat.unit.w
## # A tibble: 58,028 x 23
## Mixture TechRepMixture Run Protein Peptide RT Charge PTM
## <fct> <fct> <fct> <fct> <fct> <dbl> <fct> <fct>
## 1 Mixtur… 1 Mixt… P21291 [K].gF… 145. 2 N-Te…
## 2 Mixtur… 1 Mixt… P30511 [K].wA… 137. 2 N-Te…
## 3 Mixtur… 1 Mixt… P02787… [K].sA… 149. 3 N-Te…
## 4 Mixtur… 1 Mixt… Q9UBQ5 [K].iD… 202. 2 N-Te…
## 5 Mixtur… 1 Mixt… Q5H9R7 [R].tG… 107. 2 N-Te…
## 6 Mixtur… 1 Mixt… P43490 [K].nA… 99.8 3 N-Te…
## 7 Mixtur… 1 Mixt… P43897 [R].fE… 113. 2 N-Te…
## 8 Mixtur… 1 Mixt… Q6XQN6 [R].lQ… 190. 3 N-Te…
## 9 Mixtur… 1 Mixt… P49327 [R].dL… 215. 3 N-Te…
## 10 Mixtur… 1 Mixt… Q02790 [K].sN… 58.8 2 N-Te…
## # … with 58,018 more rows, and 15 more variables: TotalIntensity <dbl>,
## # Ions.Score <int>, DeltaMZ <dbl>, isoInterOk <lgl>, noNAs <lgl>,
## # onehit.protein <lgl>, shared.peptide <lgl>, `127C` <dbl>, `127N` <dbl>,
## # `128C` <dbl>, `128N` <dbl>, `129C` <dbl>, `129N` <dbl>, `130C` <dbl>,
## # `130N` <dbl>
Now let’s apply CONSTANd to each Run separately, and then combine the results into a semi-wide dataframe again.
# dat.unit.l entries are in long format so all have same colnames and no channelNames
x.split <- split(dat.unit.w, dat.unit.w$Run) # apply CONSTANd to each Run separately
x.split.norm <- lapply(x.split, function(y) {
y[,channelNames] <- CONSTANd(y[,channelNames])$normalized_data
return(y)
})
dat.norm.w <- bind_rows(x.split.norm)
dat.norm.w
## # A tibble: 58,028 x 23
## Mixture TechRepMixture Run Protein Peptide RT Charge PTM
## <fct> <fct> <fct> <fct> <fct> <dbl> <fct> <fct>
## 1 Mixtur… 1 Mixt… Q86X55 [R].lL… 142. 2 N-Te…
## 2 Mixtur… 1 Mixt… P14868 [R].vT… 160. 3 N-Te…
## 3 Mixtur… 1 Mixt… P49736 [R].gL… 177. 3 N-Te…
## 4 Mixtur… 1 Mixt… P21333 [K].yG… 150. 2 N-Te…
## 5 Mixtur… 1 Mixt… O14745 [R].sV… 94.6 2 N-Te…
## 6 Mixtur… 1 Mixt… Q9H7Z7 [K].pN… 184. 3 N-Te…
## 7 Mixtur… 1 Mixt… Q99873 [K].wL… 190. 2 N-Te…
## 8 Mixtur… 1 Mixt… Q08379 [R].lE… 78.3 2 N-Te…
## 9 Mixtur… 1 Mixt… P62847 [K].tT… 195. 2 N-Te…
## 10 Mixtur… 1 Mixt… Q9UG63 [K].lV… 131. 2 N-Te…
## # … with 58,018 more rows, and 15 more variables: TotalIntensity <dbl>,
## # Ions.Score <int>, DeltaMZ <dbl>, isoInterOk <lgl>, noNAs <lgl>,
## # onehit.protein <lgl>, shared.peptide <lgl>, `127C` <dbl>, `127N` <dbl>,
## # `128C` <dbl>, `128N` <dbl>, `129C` <dbl>, `129N` <dbl>, `130C` <dbl>,
## # `130N` <dbl>
These normalized quantification values are now perfectly comparable!
Now, let’s look at three ways to summarize quantification values from PSM to peptide (first step) to protein (second step).
dat.norm.summ.w <- emptyList(variant.names)
dat.nonnorm.summ.w <- emptyList(variant.names)
Median summarization is simple: within each Run and within each Channel, we replace multiple observations with their median. First, for each Peptide (median of the PSM values), then for each Protein (median of the PSM values) .
median_summarization <- function(dat) {
# group by (run,)protein,peptide then summarize twice (once on each level)
# add select() statement because summarise_at is going bananas over character columns
return(dat %>% group_by(Run, Protein, Peptide) %>% select(Run, Protein, Peptide, channelNames) %>% summarise_at(.vars = channelNames, .funs = median) %>% select(Run, Protein, channelNames) %>% summarise_at(.vars = channelNames, .funs = median) %>% ungroup())
}
# normalized data
dat.norm.summ.w$median <- median_summarization(dat.norm.w)
dat.norm.summ.w$median[, channelNames]
## # A tibble: 12,175 x 8
## `127C` `127N` `128C` `128N` `129C` `129N` `130C` `130N`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.999 1.00 0.996 1.01 1.00 0.999 0.998 1.00
## 2 0.993 1.01 1.01 1.00 0.997 1.01 1.00 0.993
## 3 0.997 0.997 1.00 1.00 1.00 0.996 0.997 1.00
## 4 1.01 0.999 0.979 1.00 1.01 1.01 0.989 1.00
## 5 1.05 1.00 1.05 0.992 0.949 0.995 1.01 0.956
## 6 1.01 0.992 0.993 1.01 0.995 0.998 1.00 1.00
## 7 0.995 0.999 1.00 0.993 0.998 1.01 0.998 1.00
## 8 0.999 0.986 0.994 0.989 0.974 1.01 1.03 1.02
## 9 1.00 0.992 0.998 0.983 1.00 1.00 1.01 1.01
## 10 0.995 1.00 1.00 1.00 0.999 0.998 1.00 1.00
## # … with 12,165 more rows
head(rowSums(dat.norm.summ.w$median[, channelNames]))
## [1] 8.005836 8.010511 8.000000 8.000000 8.000000 8.000000
Notice that the row sums are not equal to Ncols anymore, because the median summarization does not preserve them (but mean summarization does).
Let’s also summarize the non-normalized data for comparison later on .
# non-normalized data
dat.nonnorm.summ.w$median <- median_summarization(dat.unit.w)
explain iPQF
iPQF_summarization <- function(x) {
dat <- split(x, x$Run) # apply iPQF to each Run separately
# first make an MSnSet object
exprs <- lapply(dat, function(y) as.matrix(y[,channelNames]))
fdata <- lapply(dat, function(y) as.data.frame(y %>% select(-channelNames)) %>% rename(sequence='Peptide', accession='Protein', charge='Charge', modifications='PTM', mass_to_charge='DeltaMZ', search_engine_score='Ions.Score', Intensity='TotalIntensity'))
mss <- emptyList(names(dat))
for (i in seq_along(mss)) { mss[[i]] <- MSnSet(exprs = exprs[[i]], fData = fdata[[i]]) }
# then summarize
mss.norm <- lapply(mss, function(y) combineFeatures(y, method='iPQF', groupBy = fData(y)$accession, ratio.calc='none'))
mss.norm.tibble <- lapply(mss.norm, function(y) tibble(cbind(data.frame(Run=fData(y)$Run), rownames_to_column(as.data.frame(exprs(y)), var='Protein'))))
return(bind_rows(mss.norm.tibble))
}
# normalized data
dat.norm.summ.w$iPQF <- iPQF_summarization(dat.norm.w)
# non-normalized data
dat.nonnorm.summ.w$iPQF <- iPQF_summarization(dat.unit.w)
Sum summarization is completely analogous to the median summarization, except that we sum values instead of taking the median. Note that sum normalization is NOT equivalent to mean normalization: yes, rows containing NA values are removed, but there may be multiple PSMs per peptide and multiple peptides per protein. Since we know that there is a strong peptide-run interaction, summing values across peptides per protein may result in strong bias by run.
sum_summarization <- function(dat) {
# group by (run,)protein,peptide then summarize twice (once on each level)
# add select() statement because summarise_at is going bananas over character columns
return(dat %>% group_by(Run, Protein, Peptide) %>% select(Run, Protein, Peptide, channelNames) %>% summarise_at(.vars = channelNames, .funs = median) %>% select(Run, Protein, channelNames) %>% summarise_at(.vars = channelNames, .funs = sum) %>% ungroup())
}
# normalized data
dat.norm.summ.w$sum <- sum_summarization(dat.norm.w)
# non-normalized data
dat.nonnorm.summ.w$sum <- sum_summarization(dat.unit.w)
Before getting to the DEA section, let’s do some basic quality control and take a sneak peek at the differences between the component varints we’ve chosen. First, however, we should make the data completely wide, so that each sample gets it’s own unique column.
# make data completely wide (also across runs)
## normalized data
dat.norm.summ.w2 <- lapply(dat.norm.summ.w, function(x) {
return(x %>% pivot_wider(names_from = Run, values_from = all_of(channelNames), names_glue = "{Run}:{.value}"))
})
colnames(dat.norm.summ.w2)
## NULL
## non-normalized data
dat.nonnorm.summ.w2 <- lapply(dat.nonnorm.summ.w, function(x){
return(x %>% pivot_wider(names_from = Run, values_from = all_of(channelNames), names_glue = "{Run}:{.value}") )
})
These boxplots of the normalized intensities show that the distributions of Median- and iPQF-summarized values are very similar and symmetrical! In contrast, the Sum summarization produces very skewed distributions. As for normalization, …….
# use (half-)wide format
for (i in 1: n.comp.variants){
par(mfrow=c(1,2))
boxplot_w(dat.nonnorm.summ.w[[i]],sample.info, paste('Raw', variant.names[i], sep='_'))
boxplot_w(dat.norm.summ.w[[i]], sample.info, paste('Normalized', variant.names[i], sep='_'))
par(mfrow=c(1,1))
}
We now make MA plots of two single samples taken from condition 1 and condition 0.125, measured in different MS runs (samples Mixture2_1:127C and Mixture1_2:129N, respectively). Clearly, the normalization had a strong variance-reducing effect on the fold changes. However, Sum summarization counteracts this………
# different unit variants require different computation of fold changes and average abuandance: additive or multiplicative scale; see maplot_ils function
for (i in 1: n.comp.variants){
p1 <- maplot_ils(dat.nonnorm.summ.w2[[i]], 'Mixture2_1:127C', 'Mixture1_2:129N', scale.vec[i], paste('Raw', variant.names[i], sep='_'))
p2 <- maplot_ils(dat.norm.summ.w2[[i]], 'Mixture2_1:127C', 'Mixture1_2:129N', scale.vec[i], paste('Normalized', variant.names[i], sep='_'))
grid.arrange(p1, p2, ncol=2)
}
To increase the robustness of these results, let’s make some more MA plots, but now for all samples from condition 1 and condition 0.125 (quantification values averaged within condition). Indeed, even the Raw, unnormalized data now show less variability, and again even more so for the normalized data. In this case, even the Sum summarized data experienced variance reduction, although not as much.
# different unit variants require different computation of fold changes and average abuandance: additive or multiplicative scale; see maplot_ils function
channels.num <- sample.info %>% filter(Condition=='1') %>% select(Sample) %>% pull
channels.denom <- sample.info %>% filter(Condition=='0.125') %>% select(Sample) %>% pull
for (i in 1: n.comp.variants){
p1 <- maplot_ils(dat.nonnorm.summ.w2[[i]], channels.num, channels.denom, scale=scale.vec[i], paste('Raw', variant.names[i], sep='_'))
p2 <- maplot_ils(dat.norm.summ.w2[[i]], channels.num, channels.denom, scale=scale.vec[i], paste('Normalized', variant.names[i], sep='_'))
grid.arrange(p1, p2, ncol=2)
}
A more proper way to look at variabilities of conditions is by making Coefficient of Variation (CV) plots. Again, the distributions of Median- and iPQF-summarized values are very similar, and their mean CV is much lower after normalization! This time, they are also quite skew. The Sum-summarized data is also skew, but to a lesser extent. Remarkably, its mean CV is not lower after normalization, which does not seem in line with the last MA plots we made. This need not be a problem; a dense scatterplot is hard to interpret, and we may have drawn a false conclusion earlier.
dat.nonnorm.summ.l <- lapply(dat.nonnorm.summ.w, to_long_format, sample.info)
dat.norm.summ.l <- lapply(dat.norm.summ.w, to_long_format, sample.info)
par(mfrow=c(1, 2))
for (i in 1: n.comp.variants){
cvplot_ils(dat=dat.nonnorm.summ.l[[i]], feature.group='Protein', xaxis.group='Condition',
title=paste('Raw', variant.names[i], sep='_'), abs=F)
cvplot_ils(dat=dat.norm.summ.l[[i]], feature.group='Protein', xaxis.group='Condition',
title=paste('Normalized', variant.names[i], sep='_'), abs=F)
}
par(mfrow=c(1, 1))
Now, let’s check if these multi-dimensional data contains some kind of grouping; It’s time to make PCA plots! Even though PC1 does seem to capture the conditions, providing a gradient for the dilution number, only the 0.125 condition is completely separable in the normalized data. Again, here, the Sum summariztion messes everything up and just separates samples according to…….??? just like the raw data.
for (i in seq_along(dat.norm.summ.w2)){
par(mfrow=c(1, 2))
pcaplot_ils(dat.nonnorm.summ.w2[[i]] %>% select(-'Protein'), info=sample.info, paste('Raw', variant.names[i], sep='_'))
pcaplot_ils(dat.norm.summ.w2[[i]] %>% select(-'Protein'), info=sample.info, paste('Normalized', variant.names[i], sep='_'))
par(mfrow=c(1, 1))
}
There are only 19 proteins supposed to be differentially expressed in this data set, which is only a very small proportion.
Therefore, let’s see what the PC plots look like if we were to only use the spiked proteins in the PCA. Now, there are clear differences between the Raw, non-normalized median and iPQF plots, but after normalization they are very similar! This time, the separation between different conditions has become more distinct, which suggests the experiment was carried out successfully. Again, Sum summarization destroys whatever good work the normalization had done. Notice how for all PCA plots, the percentage of variance explained by PC1 is now much greater than when using data from all proteins.
for (i in seq_along(dat.norm.summ.w2)){
par(mfrow=c(1, 2))
pcaplot_ils(dat.nonnorm.summ.w2[[i]] %>% filter(Protein %in% spiked.proteins) %>% select(-'Protein'), info=sample.info, paste('Raw', variant.names[i], sep='_'))
pcaplot_ils(dat.norm.summ.w2[[i]] %>% filter(Protein %in% spiked.proteins) %>% select(-'Protein'), info=sample.info, paste('Normalized', variant.names[i], sep='_'))
par(mfrow=c(1, 1))
}
Note that in a real situation without spiked proteins, you might plot data corresponding to the top X most differential proteins instead.
The PCA plots of all proteins has a rather lower fraction of variance explained by PC1. We can confirm this using a hierarchical clustering dendrogram: when considering the entire multidimensional space, the different conditions are not very separable at all. This is not surprising as there are only 19 truly differential proteins out of………?
for (i in seq_along(dat.norm.summ.w2)){
par(mfrow=c(1, 2))
dendrogram_ils(dat.nonnorm.summ.w2[[i]] %>% filter(Protein %in% spiked.proteins) %>% select(-Protein), info=sample.info, paste('Raw', variant.names[i], sep='_'))
dendrogram_ils(dat.norm.summ.w2[[i]] %>% filter(Protein %in% spiked.proteins) %>% select(-Protein), info=sample.info, paste('Normalized', variant.names[i], sep='_'))
par(mfrow=c(1, 1))
}
We look at the log2 fold changes of all conditions w.r.t. the reference condition with dilution ratio 0.5. Since we are working with a log2 unit scale already, this means that for each protein we just look at the difference in mean observation across all channels between one condition and the reference condition. Note that this is not the same as looking at the log2 of the ratio of mean raw intensities for each condition, nor the mean ratio of raw intensities for each condition, since \(log_2 (\frac{mean(B)}{mean(A)}) \neq \frac{mean(log_2 (B))}{mean(log_2 (A))} \neq mean(\frac{log_2 (B)}{log_2 (A)})\).
To check whether these fold changes are significant (criterium: \(q<0.05\)), we use a moderated t-test slightly adapted from the limma package, which in use cases like ours should improve statistical power over a regular t-test. In a nutshell, this is a t-test done independently for each protein, although the variance used in the calculation of the t-statistic is moderated using some empirical Bayes estimation. After testing, we make a correction for multiple testing using the Benjamini-Hochberg method in order to keep the FDR under control.
NOTE: - actually, lmFit (used in moderated_ttest) was built for log2-transformed data. However, supplying untransformed intensities can also work. This just means that the effects in the linear model are also additive on the untransformed scale, whereas for log-transformed data they are multiplicative on the untransformed scale. Also, there may be a bias which occurs from biased estimates of the population means in the t-tests, as mean(X) is not equal to exp(mean(log(X))).
# design matrix as used in ANOVA testing.
design.matrix <- get_design_matrix(referenceCondition, sample.info)
dat.dea <- emptyList(names(dat.norm.summ.w2))
for (i in seq_along(dat.norm.summ.w2)) {
# provide scale so moderated_ttest knows whether you input log2 or raw intensities.
this_scale <- scale.vec[match(names(dat.dea)[i], variant.names)]
d <- column_to_rownames(as.data.frame(dat.norm.summ.w2[[i]]), 'Protein')
dat.dea[[i]] <- moderated_ttest(dat=d, design.matrix, scale=this_scale)
}
For each condition, we now get the fold changes, moderated and unmoderated p-values, moderated and unmoderated q-values (BH-adjusted p-values), and some other details.
colnames(dat.dea[[1]])
## [1] "logFC_0.125" "logFC_0.667" "logFC_1" "t.ord_0.125" "t.ord_0.667"
## [6] "t.ord_1" "t.mod_0.125" "t.mod_0.667" "t.mod_1" "p.ord_0.125"
## [11] "p.ord_0.667" "p.ord_1" "p.mod_0.125" "p.mod_0.667" "p.mod_1"
## [16] "q.ord_0.125" "q.ord_0.667" "q.ord_1" "q.mod_0.125" "q.mod_0.667"
## [21] "q.mod_1" "df.r" "df.0" "s2.0" "s2"
## [26] "s2.post"
Let’s look at how our component variants have affected the outcome of the DEA.
Piotr, can you do this?
cm <- conf_mat(dat.dea, 'q.mod', 0.05, spiked.proteins)
print_conf_mat(cm)
| background | spiked | background | spiked | background | spiked | |
|---|---|---|---|---|---|---|
| not_DE | 4061 | 5 | 4061 | 5 | 4064 | 16 |
| DE | 3 | 14 | 3 | 14 | 0 | 3 |
| median | iPQF | sum | |
|---|---|---|---|
| Accuracy | 0.998 | 0.998 | 0.996 |
| Sensitivity | 0.737 | 0.737 | 0.158 |
| Specificity | 0.999 | 0.999 | 1.000 |
| PPV | 0.824 | 0.824 | 1.000 |
| NPV | 0.999 | 0.999 | 0.996 |
| background | spiked | background | spiked | background | spiked | |
|---|---|---|---|---|---|---|
| not_DE | 4064 | 15 | 4063 | 16 | 4064 | 19 |
| DE | 0 | 4 | 1 | 3 | 0 | 0 |
| median | iPQF | sum | |
|---|---|---|---|
| Accuracy | 0.996 | 0.996 | 0.995 |
| Sensitivity | 0.211 | 0.158 | 0.000 |
| Specificity | 1.000 | 1.000 | 1.000 |
| PPV | 1.000 | 0.750 | NaN |
| NPV | 0.996 | 0.996 | 0.995 |
| background | spiked | background | spiked | background | spiked | |
|---|---|---|---|---|---|---|
| not_DE | 4059 | 5 | 4060 | 5 | 4064 | 16 |
| DE | 5 | 14 | 4 | 14 | 0 | 3 |
| median | iPQF | sum | |
|---|---|---|---|
| Accuracy | 0.998 | 0.998 | 0.996 |
| Sensitivity | 0.737 | 0.737 | 0.158 |
| Specificity | 0.999 | 0.999 | 1.000 |
| PPV | 0.737 | 0.778 | 1.000 |
| NPV | 0.999 | 0.999 | 0.996 |
To see whether the three Summarization methods produce similar results on the detailed level of individual proteins, we make scatter plots and check the correlation of their fold changes and q-values.
For all conditions, we see the iPQF consistently produces slightly higher q-values in the \([0, 0.5]\) interval than the Median summarization, with which it correlates \(>0.82\) overall. Towards \(q=1\) the correlation is worse, but that is not surprising as that is the domain of housekeeping proteins, which moreover are not of particular interest. The Sum summarization resulted in very, very little significant observations.
scatterplot_ils(dat.dea, q.cols, 'q-values')
The correlation metween Median summarization and iPQF is even higher for the fold changes: \(>0.96\) for all conditions. This is to be expected based on the q-value plots, as a large difference in the test statistic can be due to a small difference in fold change. The plots involving Sum summarization have an anomaly around 0 fold change.
scatterplot_ils(dat.dea, logFC.cols, 'log2FC')
volcano plots are narrow due to constand+log2
The volcano plot combines information on fold changes and statistical significance. The spike-in proteins are colored blue, and immediately it is clear that their fold changes dominate the region of statistical significance, which suggests the experiment and analysis were carried out successfully.
for (i in 1:n.contrasts){
volcanoplot_ils(dat.dea, i, spiked.proteins)
}
On a ‘macroscopic’ scale, the best way to find out whether this experiment and subsequent analysis was successful is to check whether the spike-in proteins have attained the fold change that corresponds to their condition. Instead of plotting each protein individually, we make violin plots to get a sense of the general trend. The theoretical fold changes are indicated with red dotted lines.
Clearly, the empirical results tend towards the theoretical truth, however not a single observation attained the fold change it should have attained. There is clearly a strong bias towards zero fold change, which may partly be explained by the ratio compression phenomenon in mass spectrometry, although the effect seems quite extreme here. It seems that Median summarization and iPQF produce very similar violins, while Sum summarization is again the odd one out.
# plot theoretical value (horizontal lines) and violin per condition
dat.spiked.logfc <- lapply(dat.dea, function(x) x[spiked.proteins,logFC.cols])
dat.spiked.logfc.l <- lapply(dat.spiked.logfc, function(x) {
x %>% rename_with(function(y) sapply(y, function(z) strsplit(z, '_')[[1]][2])) %>% pivot_longer(cols = everything(), names_to = 'condition', values_to = 'logFC') %>% add_column(Protein=rep(rownames(x), each=length(colnames(x)))) })
violinplot_ils(lapply(dat.spiked.logfc.l, filter, condition != referenceCondition))
For the given data set, the differences in proteomic outcomes between Median and iPQF normalization are quite small, both on the global and individual scale. Their differences with the Sum summarization outcomes are very large. The QC plots suggest that the Sum summarization obfuscates any work done by the normalization.
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=de_BE.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=de_BE.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=de_BE.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=de_BE.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] CONSTANd_0.99.4 forcats_0.5.0 stringr_1.4.0
## [4] dplyr_1.0.2 purrr_0.3.4 readr_1.4.0
## [7] tidyr_1.1.2 tibble_3.0.4 tidyverse_1.3.0
## [10] MSnbase_2.15.7 ProtGenerics_1.21.0 S4Vectors_0.27.14
## [13] mzR_2.23.1 Rcpp_1.0.5 Biobase_2.49.1
## [16] BiocGenerics_0.35.4 psych_2.0.9 limma_3.45.19
## [19] kableExtra_1.3.1 dendextend_1.14.0 gridExtra_2.3
## [22] stringi_1.5.3 ggplot2_3.3.2
##
## loaded via a namespace (and not attached):
## [1] colorspace_1.4-1 ellipsis_0.3.1 class_7.3-17
## [4] fs_1.5.0 rstudioapi_0.11 farver_2.0.3
## [7] affyio_1.59.0 prodlim_2019.11.13 fansi_0.4.1
## [10] lubridate_1.7.9 xml2_1.3.2 codetools_0.2-16
## [13] splines_4.0.3 ncdf4_1.17 mnormt_2.0.2
## [16] doParallel_1.0.16 impute_1.63.0 knitr_1.30
## [19] jsonlite_1.7.1 pROC_1.16.2 caret_6.0-86
## [22] broom_0.7.2 vsn_3.57.0 dbplyr_1.4.4
## [25] BiocManager_1.30.10 compiler_4.0.3 httr_1.4.2
## [28] backports_1.1.10 assertthat_0.2.1 Matrix_1.2-18
## [31] cli_2.1.0 htmltools_0.5.0 tools_4.0.3
## [34] gtable_0.3.0 glue_1.4.2 affy_1.67.1
## [37] reshape2_1.4.4 MALDIquant_1.19.3 cellranger_1.1.0
## [40] vctrs_0.3.4 preprocessCore_1.51.0 nlme_3.1-150
## [43] iterators_1.0.13 timeDate_3043.102 xfun_0.18
## [46] gower_0.2.2 rvest_0.3.6 lifecycle_0.2.0
## [49] XML_3.99-0.5 zlibbioc_1.35.0 MASS_7.3-53
## [52] scales_1.1.1 ipred_0.9-9 pcaMethods_1.81.0
## [55] hms_0.5.3 yaml_2.2.1 rpart_4.1-15
## [58] highr_0.8 foreach_1.5.1 e1071_1.7-4
## [61] BiocParallel_1.23.3 lava_1.6.8 rlang_0.4.8
## [64] pkgconfig_2.0.3 mzID_1.27.0 evaluate_0.14
## [67] lattice_0.20-41 recipes_0.1.14 labeling_0.4.2
## [70] tidyselect_1.1.0 plyr_1.8.6 magrittr_1.5
## [73] R6_2.4.1 IRanges_2.23.10 generics_0.0.2
## [76] DBI_1.1.0 pillar_1.4.6 haven_2.3.1
## [79] withr_2.3.0 mgcv_1.8-33 nnet_7.3-14
## [82] survival_3.2-7 modelr_0.1.8 crayon_1.3.4
## [85] utf8_1.1.4 tmvnsim_1.0-2 rmarkdown_2.5
## [88] viridis_0.5.1 grid_4.0.3 readxl_1.3.1
## [91] data.table_1.13.2 blob_1.2.1 ModelMetrics_1.2.2.2
## [94] reprex_0.3.0 digest_0.6.27 webshot_0.5.2
## [97] munsell_0.5.0 viridisLite_0.3.0